Coalescent-based species tree estimation: a stochastic Farris transform

نویسندگان

  • Gautam Dasarathy
  • Elchanan Mossel
  • Robert D. Nowak
  • Sébastien Roch
چکیده

The reconstruction of a species phylogeny from genomic data faces two significant hurdles: 1) the trees describing the evolution of each individual gene—i.e., the gene trees—may differ from the species phylogeny and 2) the molecular sequences corresponding to each gene often provide limited information about the gene trees themselves. In this paper we consider an approach to species tree reconstruction that addresses both these hurdles. Specifically, we propose an algorithm for phylogeny reconstruction under the multispecies coalescent model with a standard model of site substitution. The multispecies coalescent is commonly used to model gene tree discordance due to incomplete lineage sorting, a well-studied populationgenetic effect. In previous work, an information-theoretic trade-off was derived in this context between the number of loci,m, needed for an accurate reconstruction and the length of the locus sequences, k. It was shown that to reconstruct an internal branch of length f , one needs m to be of the order of 1/[f2 √ k]. That previous result was obtained under the molecular clock assumption, i.e., under the assumption that mutation rates (as well as population sizes) are constant across the species phylogeny. Here we generalize this result beyond the restrictive molecular clock assumption, and obtain a new reconstruction algorithm that has the same data requirement (up to log factors). Our main contribution is a novel reduction to the molecular clock case under the multispecies coalescent. As a corollary, we also obtain a new identifiability result of independent interest: for any species tree with n ≥ 3 species, the rooted species tree can be identified from the distribution of its unrooted weighted gene trees even in the absence of a molecular clock. ∗

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Genes with minimal phylogenetic information are problematic for coalescent analyses when gene tree estimation is biased.

The development and application of coalescent methods are undergoing rapid changes. One little explored area that bears on the application of gene-tree-based coalescent methods to species tree estimation is gene informativeness. Here, we investigate the accuracy of these coalescent methods when genes have minimal phylogenetic information, including the implementation of the multilocus bootstrap...

متن کامل

Phylogeny estimation of the radiation of western North American chipmunks (Tamias) in the face of introgression using reproductive protein genes.

The causes and consequences of rapid radiations are major unresolved issues in evolutionary biology. This is in part because phylogeny estimation is confounded by processes such as stochastic lineage sorting and hybridization. Because these processes are expected to be heterogeneous across the genome, comparison among marker classes may provide a means of disentangling these elements. Here we u...

متن کامل

Statistical binning enables an accurate coalescent-based estimation of the avian tree.

Gene tree incongruence arising from incomplete lineage sorting (ILS) can reduce the accuracy of concatenation-based estimations of species trees. Although coalescent-based species tree estimation methods can have good accuracy in the presence of ILS, they are sensitive to gene tree estimation error. We propose a pipeline that uses bootstrapping to evaluate whether two genes are likely to have t...

متن کامل

Weighted Statistical Binning: Enabling Statistically Consistent Genome-Scale Phylogenetic Analyses

Because biological processes can result in different loci having different evolutionary histories, species tree estimation requires multiple loci from across multiple genomes. While many processes can result in discord between gene trees and species trees, incomplete lineage sorting (ILS), modeled by the multi-species coalescent, is considered to be a dominant cause for gene tree heterogeneity....

متن کامل

Sources of error inherent in species-tree estimation: impact of mutational and coalescent effects on accuracy and implications for choosing among different methods.

Discord in the estimated gene trees among loci can be attributed to both the process of mutation and incomplete lineage sorting. Effectively modeling these two sources of variation--mutational and coalescent variance--provides two distinct challenges for phylogenetic studies. Despite extensive investigation on mutational models for gene-tree estimation over the past two decades and recent atten...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1707.04300  شماره 

صفحات  -

تاریخ انتشار 2017